Project 5 - Unsupervised Learning

Problem Statement: Credit Card Customer Segmentation

Background

AllLife Bank wants to focus on its credit card customer base in the next financial year. They have been advised by their marketing research team, that the penetration in the market can be improved. Based on this input, the Marketing team proposes to run personalised campaigns to target new customers as well as upsell to existing customers. Another insight from the market research was that the customers perceive the support services of the back poorly. Based on this, the Operations team wants to upgrade the service delivery model, to ensure that customers queries are resolved faster. Head of Marketing and Head of Delivery both decide to reach out to the Data Science team for help.

Objective

To identify different segments in the existing customer based on their spending patterns as well as past interaction with the bank.

Steps and Tasks:

  1. Perform univariate analysis on the data to better understand the variables at your disposal and to get an idea about the no of clusters. Perform EDA, create visualizations to explore data. (10 marks)
  2. Properly comment on the codes, provide explanations of the steps taken in the notebook and conclude your insights from the graphs. (5 marks)
  3. Execute K-means clustering use elbow plot and analyse clusters using boxplot (10 marks)
  4. Execute hierarchical clustering (with different linkages) with the help of dendrogram and cophenetic coeff. Analyse clusters formed using boxplot (15 marks)
  5. Calculate average silhouette score for both methods. (5 marks)
  6. Compare K-means clusters with Hierarchical clusters. (5 marks)
  7. Analysis the clusters formed, tell us how is one cluster different from another and answer all the key questions. (10 marks)

Attribute Information:

Data is of various customers of a bank with their credit limit, the total number of credit cards the customer has, and different channels through which customer has contacted the bank for any queries, different channels include visiting the bank, online and through a call.

Input variables:

  • Customer key - Identifier for the customer
  • Average Credit Limit - Average credit limit across all the credit cards
  • Total credit cards - Total number of credit cards
  • Total visits bank - Total number of bank visits
  • Total visits online - total number of online visits
  • Total calls made - Total number of calls made by the customer

Key Questions

  1. How many different segments of customers are there?
  2. How are these segments different from each other?
  3. What are your recommendations to the bank on how to better market to and service these customers?

For this project I will use the experience from previous project as a template. This will allows me to move faster in the initial part of the project while doing EDA and even some feature engineering. After this I will apply K-means and hierarquical clustering. Hence most of the intial steps (plots, graphics, functions,etc) will be similar to the one used in previous projects.

As seen in the mentored session, applying a templete is a good strategy to be more efficient and get better results.

For this Project i will try to apply the same structural steps than in project 4, it is to say, separete the plot section from graphics, in order to move easily along the document.

Import Libraries

In [1]:
import warnings
warnings.filterwarnings('ignore')
In [2]:
import pandas as pd #Read files
import numpy as np # numerical libraries


# Import visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
#%matplotlib inline 

# Import libraries to work on K-means
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from IPython.display import Image  
from os import system
In [14]:
pd.options.display.float_format = '{:,.2f}'.format
In [179]:
# Below we will read the data from the local folder
df = pd.read_excel('Credit Card Customer Data.xlsx')

# Now display the header 
print ('Credit Card Customer Data.xlsx data set:')
df.head(5)
Credit Card Customer Data.xlsx data set:
Out[179]:
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
0 1 87073 100000 2 1 1 0
1 2 38414 50000 3 0 10 9
2 3 17341 50000 7 1 3 4
3 4 40496 30000 5 1 1 4
4 5 47437 100000 6 0 12 3
In [180]:
# "Sl-No" looks like the index column of the data, so before drop it I will display it as an index.

df = pd.read_excel('Credit Card Customer Data.xlsx', index_col= 'Sl_No')

# And now will display the header again
print ('Credit Card Customer Data.xlsx data set:')
df.head(10)
Credit Card Customer Data.xlsx data set:
Out[180]:
Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
Sl_No
1 87073 100000 2 1 1 0
2 38414 50000 3 0 10 9
3 17341 50000 7 1 3 4
4 40496 30000 5 1 1 4
5 47437 100000 6 0 12 3
6 58634 20000 3 0 1 8
7 48370 100000 5 0 11 2
8 37376 15000 3 0 1 1
9 82490 5000 2 0 2 2
10 44770 3000 4 0 1 7
In [56]:
df.tail() ## to know how the end of the data looks like
Out[56]:
Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
Sl_No
656 51108 99000 10 1 10 0
657 60732 84000 10 1 13 2
658 53834 145000 8 1 9 1
659 80655 172000 10 1 15 0
660 80150 167000 9 0 12 2
  • ## 1.1 Univariate analysis
In [68]:
df.info() # here we will see the number of entires(rows and columns), dtype, and non-nullcount
<class 'pandas.core.frame.DataFrame'>
Int64Index: 660 entries, 1 to 660
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   Customer Key         660 non-null    int64
 1   Avg_Credit_Limit     660 non-null    int64
 2   Total_Credit_Cards   660 non-null    int64
 3   Total_visits_bank    660 non-null    int64
 4   Total_visits_online  660 non-null    int64
 5   Total_calls_made     660 non-null    int64
dtypes: int64(6)
memory usage: 36.1 KB
In [69]:
print(f"The given dataset contains {df.shape[0]} rows and {df.shape[1]} columns")
print(f"The given dataset contains {df.isna().sum().sum()} Null value")
The given dataset contains 660 rows and 6 columns
The given dataset contains 0 Null value
In [80]:
neg_exp=df[df.lt(0)] # this is to see the number of negative values present 
print (" the number of negative entries is",sum(n < 0 for n in df.values.flatten())) 
# the output might be taken in consideration later on in the calculations.
 the number of negative entries is 0

Insight # 1:

  • There are no non-null values, i.e., there is not missing values since we have a value for every column and row.
  • There are no negative numbers in the dataframe
  • I will drop the columns that we don't need at the end of the EDA, those are maybe SI_NO and customer Key
In [82]:
df.shape # size of the data set (# rows or entries, # columns or variables)
Out[82]:
(660, 6)
In [76]:
df.describe().transpose() # Transpose is used here  to read better the attribute
Out[76]:
count mean std min 25% 50% 75% max
Customer Key 660.000 55,141.444 25,627.772 11,265.000 33,825.250 53,874.500 77,202.500 99,843.000
Avg_Credit_Limit 660.000 34,574.242 37,625.488 3,000.000 10,000.000 18,000.000 48,000.000 200,000.000
Total_Credit_Cards 660.000 4.706 2.168 1.000 3.000 5.000 6.000 10.000
Total_visits_bank 660.000 2.403 1.632 0.000 1.000 2.000 4.000 5.000
Total_visits_online 660.000 2.606 2.936 0.000 1.000 2.000 4.000 15.000
Total_calls_made 660.000 3.583 2.865 0.000 1.000 3.000 5.000 10.000
In [77]:
df.nunique() # Number of unique values in a column 
# this help to identify categorical values and will give us an idea of possible groups or clusters
Out[77]:
Customer Key           655
Avg_Credit_Limit       110
Total_Credit_Cards      10
Total_visits_bank        6
Total_visits_online     16
Total_calls_made        11
dtype: int64
In [28]:
#Since the "customer Key" has almost the same unique values than the total of imputs (660), propabaly some customer have more than 1 entry
# This means that Either the entries are duplicated or perhaps in different period of time bank branch.
# I will try to confirm it with the next line
pd.value_counts(df['Customer Key']) 
Out[28]:
47437    2
37252    2
97935    2
96929    2
50706    2
        ..
66706    1
72339    1
69965    1
85645    1
71681    1
Name: Customer Key, Length: 655, dtype: int64
In [99]:
repeated_cust=(47437, 37252,97935, 96929, 50706 ) ## list of duplicate customer taken from the result above
df.loc[df['Customer Key'].isin(repeated_cust)] # Thid command is to see the duplicate data
Out[99]:
Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
Sl_No
5 47437 100000 6 0 12 3
49 37252 6000 4 0 2 8
105 97935 17000 2 1 2 10
333 47437 17000 7 3 1 0
392 96929 13000 4 5 0 0
399 96929 67000 6 2 2 2
412 50706 44000 4 5 0 2
433 37252 59000 6 2 1 2
542 50706 60000 7 5 2 2
633 97935 187000 7 1 7 0

Insight # 2:

  • Based on the results above most of the variables are categorical and tell us that we could group them, for example in 6 or 11 groups
  • There are 5 customers with 2 entries (since there 655 customer and 660 serial). Despite the customer are duplicated, the values are not the same as seen in the table above. I will keep the data as it is and treat them as separete inputs because 5 out of 660 entries will not affect the results.
  • there are 110 different Credit card limit with a wide range from min to max. The table above shows that there is outliers for this variable for the difference between the 75% and the max value like the Avg_Crdit_limit
  • The variable "visits online" seems to have outliers for the same reason as the avg credit card limit
  • The data need to be scalated in order to compares the variables, specially for the avg credit limit which has bigger numbers than the rest
In [84]:
pd.options.display.float_format = '{:,.3f}'.format # to see 3 decimals in the output of the cell below
In [85]:
 # Now we will get a list of unique values to evalaute how to arrange the data set
for a in list(df.columns):
    n = df[a].unique()
    
    # if number of unique values is less than 30, print the values. Otherwise print the number of unique values
    if len(n)<30:
        print(a + ': ')
        print(df[a].value_counts(normalize=True))
        print()
    else:
        print(a + ': ' +str(len(n)) + ' unique values')
        print()
Customer Key: 655 unique values

Avg_Credit_Limit: 110 unique values

Total_Credit_Cards: 
4    0.229
6    0.177
7    0.153
5    0.112
2    0.097
1    0.089
3    0.080
10   0.029
9    0.017
8    0.017
Name: Total_Credit_Cards, dtype: float64

Total_visits_bank: 
2   0.239
1   0.170
3   0.152
0   0.152
5   0.148
4   0.139
Name: Total_visits_bank, dtype: float64

Total_visits_online: 
2    0.286
0    0.218
1    0.165
4    0.105
5    0.082
3    0.067
15   0.015
7    0.011
12   0.009
10   0.009
8    0.009
13   0.008
11   0.008
9    0.006
14   0.002
6    0.002
Name: Total_visits_online, dtype: float64

Total_calls_made: 
4    0.164
0    0.147
2    0.138
1    0.136
3    0.126
6    0.059
7    0.053
9    0.048
8    0.045
5    0.044
10   0.039
Name: Total_calls_made, dtype: float64

Insight # 3:

  • The number of Total visits online and the visits to the bank with more repetition is 2(mode).
  • The total visit onlines have a wider range of entries. It will interesting to see the difference between visit to the bank and online per customer.This will we evaluated later on with the bivariate analysis.
  • The Mode of Calls_made and number of credit cart is the same: 4.
  • ### Scaling the data: as mentioned in the Insight 2
In [181]:
# I will escale it the data using z scores in order to: 
## - Compare the variables between them and  
## - Being able to apply clustering methods. 

interest_df = df.drop(['Customer Key'], axis=1) # will drop the customer key since it will not add any value to the output
In [182]:
from scipy.stats import zscore
interest_df_z = interest_df.apply(zscore)
In [183]:
interest_df_z.head()
Out[183]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
Sl_No
1 1.740 -1.249 -0.860 -0.547 -1.252
2 0.410 -0.788 -1.474 2.521 1.892
3 0.410 1.059 -0.860 0.134 0.146
4 -0.122 0.136 -0.860 -0.547 0.146
5 1.740 0.597 -1.474 3.202 -0.204
    • ### Box plots
In [106]:
plt.subplots(figsize=(15,10))
ax = sns.boxplot(data=interest_df_z)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45);

Insight # 4:

  • Avg Credit limit As expected (Insight 2) is showing outliers and also less outliers are prsent in: total visit online
  • These boxplot will be evalauted with histograms below.
    • ### Histograms

Here I will check each variable to evaluate the body of distributions/tails

In [116]:
interest_df_z.hist(stacked=False, bins=100, figsize=(30,30), layout=(2,3)); 
# Histogram will show graphically what was seen in the unique values above.
In [118]:
## Please notice: I found this commands on internet and it allows a better and faster visualitzation of multiples displot, 
## Used in the previous project 
#### With this code we can also see the mean of each variable which is better than the histogram plotted above.

##### This will give me an Initial idea  of possible groups but it will be seen again below in the Bi-variate analysis

import itertools 
import statistics 

cols = [i for i in interest_df.columns]

fig = plt.figure(figsize=(20, 25))

for i,j in itertools.zip_longest(cols, range(len(cols))):
    plt.subplot(5,2,j+1)
    ax = sns.distplot(df[i],color='blue',rug=True)
    plt.axvline(df[i].mean(),linestyle="dashed",label="mean", color='black')
    plt.axvline(statistics.mode(df[i]),linestyle="dashed",label="Mode", color='Red')
    plt.axvline(statistics.median(df[i]),linestyle="dashed",label="MEDIAN", color='Green')
    plt.legend()
    plt.title(i)
    plt.xlabel("")

Insight # 5:

  • From the KDE plot we can see the outliers mentioned in the Insight 2 and strong positive skewness in Avg Credit (where the median is greater than the mode) and visits online (the mode and median are almost the same), as well as of Avg Credit limit and visits online. Total calls made is slightly skew with the mode greater than the mean & median.
  • Also for each variables we can see the combination of at least 2 gausians or more, for instance in total crdit cards (with normal distribution) we can see 4. It means that we could use the amount of credit card to create groups.
  • ## 1.2 Bi-variate analysis

Here I will check the correlation between variables

In [120]:
interest_df_z.corr() # with this function will try to see a correlation between variables numerically 
Out[120]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
Avg_Credit_Limit 1.000 0.609 -0.100 0.551 -0.414
Total_Credit_Cards 0.609 1.000 0.316 0.168 -0.651
Total_visits_bank -0.100 0.316 1.000 -0.552 -0.506
Total_visits_online 0.551 0.168 -0.552 1.000 0.127
Total_calls_made -0.414 -0.651 -0.506 0.127 1.000

Insight # 6:

  • Avg Credit limit has highest correlation (>=55%) with Total_credit_Cards and Visit_online and 41% with Total_calls_ made(negative correlation)
  • Total_credit_Cards has a correlation of 31% with Visits_Bank and a big negative correlation with Calls_made (65%).
  • Total_visits_bank shows a good negative correlation > 50% with total_calls_made & Total_visits_online
  • The lowes correlation is between visits_bank vs Credit_Limit (-10%), calls_made vs visits_online (13%) and Credit_Cards vs visits_online
    • ### Pair Plots
In [121]:
g = sns.PairGrid(interest_df_z)
g.map_upper(plt.scatter)
g.map_lower(sns.lineplot)
g.map_diag(sns.kdeplot, lw=3, legend=True);
In [123]:
sns.pairplot(interest_df_z , hue='Total_visits_bank' , diag_kind = 'kde')
# using Visit_bank beacuse has less Number of unique values in a column
plt.show()

Insight # 7

  • in the plots above we see again between 2 and 4 groups (gaussians) and will evaluate later on in the clustering.
    • ### Heatmap
In [126]:
#Another correlation methods
plt.figure(figsize=(10,10))
mask = np.zeros_like(interest_df_z.corr('spearman'))
mask[np.triu_indices_from(mask)] = True
ax =sns.heatmap(interest_df_z.corr(),
            annot=True,
            linewidths=.5,
            center=0,
            cmap="YlGnBu",
            mask= mask,
           )
ax.set_xticklabels(ax.get_xticklabels(), rotation=45);
plt.show()

Insight # 8

  • We confirm here what was explained in the insight 6
  • ### I will run Pandas profiling just to confirm what mentioned in the previous lines. This code was seen in the latest mentor session after all the exploration analisys was done but I want to confirm the findings so it can be used for futures projects to save some time. Also this step can be added as a common practice.
In [7]:
#pip install pandas-profiling[notebook] 
Collecting pandas-profiling[notebook]
  Downloading pandas_profiling-2.9.0-py2.py3-none-any.whl (258 kB)
Collecting visions[type_image_path]==0.5.0
  Downloading visions-0.5.0-py3-none-any.whl (64 kB)
Requirement already satisfied: scipy>=1.4.1 in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (1.5.0)
Collecting htmlmin>=0.1.12
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
Requirement already satisfied: requests>=2.23.0 in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (2.24.0)
Note: you may need to restart the kernel to use updated packages.Requirement already satisfied: matplotlib>=3.2.0 in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (3.2.2)

Requirement already satisfied: numpy>=1.16.0 in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (1.18.5)
Collecting phik>=0.9.10
  Downloading phik-0.10.0-py3-none-any.whl (599 kB)
Requirement already satisfied: jinja2>=2.11.1 in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (2.11.2)
Requirement already satisfied: pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3 in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (1.0.5)
Requirement already satisfied: tqdm>=4.43.0 in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (4.47.0)
Collecting confuse>=1.0.0
  Downloading confuse-1.4.0-py2.py3-none-any.whl (21 kB)
Requirement already satisfied: joblib in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (0.16.0)
Collecting tangled-up-in-unicode>=0.0.6
  Downloading tangled_up_in_unicode-0.0.6-py3-none-any.whl (3.1 MB)
Requirement already satisfied: ipywidgets>=7.5.1 in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (7.5.1)
Collecting missingno>=0.4.2
  Downloading missingno-0.4.2-py3-none-any.whl (9.7 kB)
Requirement already satisfied: seaborn>=0.10.1 in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (0.10.1)
Requirement already satisfied: attrs>=19.3.0 in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (19.3.0)
Requirement already satisfied: jupyter-client>=6.0.0; extra == "notebook" in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (6.1.6)
Requirement already satisfied: jupyter-core>=4.6.3; extra == "notebook" in c:\users\pedro\anaconda3\lib\site-packages (from pandas-profiling[notebook]) (4.6.3)
Requirement already satisfied: networkx>=2.4 in c:\users\pedro\anaconda3\lib\site-packages (from visions[type_image_path]==0.5.0->pandas-profiling[notebook]) (2.4)
Collecting imagehash; extra == "type_image_path"
  Downloading ImageHash-4.2.0-py2.py3-none-any.whl (295 kB)
Requirement already satisfied: Pillow; extra == "type_image_path" in c:\users\pedro\anaconda3\lib\site-packages (from visions[type_image_path]==0.5.0->pandas-profiling[notebook]) (7.2.0)
Requirement already satisfied: chardet<4,>=3.0.2 in c:\users\pedro\anaconda3\lib\site-packages (from requests>=2.23.0->pandas-profiling[notebook]) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\users\pedro\anaconda3\lib\site-packages (from requests>=2.23.0->pandas-profiling[notebook]) (1.25.9)
Requirement already satisfied: idna<3,>=2.5 in c:\users\pedro\anaconda3\lib\site-packages (from requests>=2.23.0->pandas-profiling[notebook]) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\pedro\anaconda3\lib\site-packages (from requests>=2.23.0->pandas-profiling[notebook]) (2020.6.20)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\pedro\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling[notebook]) (1.2.0)
Requirement already satisfied: python-dateutil>=2.1 in c:\users\pedro\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling[notebook]) (2.8.1)
Requirement already satisfied: cycler>=0.10 in c:\users\pedro\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling[notebook]) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\users\pedro\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling[notebook]) (2.4.7)
Requirement already satisfied: numba>=0.38.1 in c:\users\pedro\anaconda3\lib\site-packages (from phik>=0.9.10->pandas-profiling[notebook]) (0.50.1)
Requirement already satisfied: MarkupSafe>=0.23 in c:\users\pedro\anaconda3\lib\site-packages (from jinja2>=2.11.1->pandas-profiling[notebook]) (1.1.1)
Requirement already satisfied: pytz>=2017.2 in c:\users\pedro\anaconda3\lib\site-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3->pandas-profiling[notebook]) (2020.1)
Requirement already satisfied: pyyaml in c:\users\pedro\anaconda3\lib\site-packages (from confuse>=1.0.0->pandas-profiling[notebook]) (5.3.1)
Requirement already satisfied: nbformat>=4.2.0 in c:\users\pedro\anaconda3\lib\site-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (5.0.7)
Requirement already satisfied: ipython>=4.0.0; python_version >= "3.3" in c:\users\pedro\anaconda3\lib\site-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (7.16.1)
Requirement already satisfied: traitlets>=4.3.1 in c:\users\pedro\anaconda3\lib\site-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (4.3.3)
Requirement already satisfied: ipykernel>=4.5.1 in c:\users\pedro\anaconda3\lib\site-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (5.3.2)
Requirement already satisfied: widgetsnbextension~=3.5.0 in c:\users\pedro\anaconda3\lib\site-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (3.5.1)
Requirement already satisfied: tornado>=4.1 in c:\users\pedro\anaconda3\lib\site-packages (from jupyter-client>=6.0.0; extra == "notebook"->pandas-profiling[notebook]) (6.0.4)
Requirement already satisfied: pyzmq>=13 in c:\users\pedro\anaconda3\lib\site-packages (from jupyter-client>=6.0.0; extra == "notebook"->pandas-profiling[notebook]) (19.0.1)
Requirement already satisfied: pywin32>=1.0; sys_platform == "win32" in c:\users\pedro\anaconda3\lib\site-packages (from jupyter-core>=4.6.3; extra == "notebook"->pandas-profiling[notebook]) (227)
Requirement already satisfied: decorator>=4.3.0 in c:\users\pedro\anaconda3\lib\site-packages (from networkx>=2.4->visions[type_image_path]==0.5.0->pandas-profiling[notebook]) (4.4.2)
Requirement already satisfied: six in c:\users\pedro\anaconda3\lib\site-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.5.0->pandas-profiling[notebook]) (1.15.0)
Requirement already satisfied: PyWavelets in c:\users\pedro\anaconda3\lib\site-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.5.0->pandas-profiling[notebook]) (1.1.1)
Requirement already satisfied: llvmlite<0.34,>=0.33.0.dev0 in c:\users\pedro\anaconda3\lib\site-packages (from numba>=0.38.1->phik>=0.9.10->pandas-profiling[notebook]) (0.33.0+1.g022ab0f)
Requirement already satisfied: setuptools in c:\users\pedro\anaconda3\lib\site-packages (from numba>=0.38.1->phik>=0.9.10->pandas-profiling[notebook]) (49.2.0.post20200714)
Requirement already satisfied: ipython-genutils in c:\users\pedro\anaconda3\lib\site-packages (from nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.2.0)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in c:\users\pedro\anaconda3\lib\site-packages (from nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (3.2.0)
Requirement already satisfied: backcall in c:\users\pedro\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.2.0)
Requirement already satisfied: pickleshare in c:\users\pedro\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.7.5)
Requirement already satisfied: colorama; sys_platform == "win32" in c:\users\pedro\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.4.3)
Requirement already satisfied: jedi>=0.10 in c:\users\pedro\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.17.1)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in c:\users\pedro\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (3.0.5)
Requirement already satisfied: pygments in c:\users\pedro\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (2.6.1)
Requirement already satisfied: notebook>=4.4.1 in c:\users\pedro\anaconda3\lib\site-packages (from widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (6.0.3)
Requirement already satisfied: pyrsistent>=0.14.0 in c:\users\pedro\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.16.0)
Requirement already satisfied: parso<0.8.0,>=0.7.0 in c:\users\pedro\anaconda3\lib\site-packages (from jedi>=0.10->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.7.0)
Requirement already satisfied: wcwidth in c:\users\pedro\anaconda3\lib\site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.2.5)
Requirement already satisfied: Send2Trash in c:\users\pedro\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (1.5.0)
Requirement already satisfied: nbconvert in c:\users\pedro\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (5.6.1)
Requirement already satisfied: prometheus-client in c:\users\pedro\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.8.0)
Requirement already satisfied: terminado>=0.8.1 in c:\users\pedro\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.8.3)
Requirement already satisfied: entrypoints>=0.2.2 in c:\users\pedro\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.3)
Requirement already satisfied: pandocfilters>=1.4.1 in c:\users\pedro\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (1.4.2)
Requirement already satisfied: mistune<2,>=0.8.1 in c:\users\pedro\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.8.4)
Requirement already satisfied: testpath in c:\users\pedro\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.4.4)
Requirement already satisfied: defusedxml in c:\users\pedro\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.6.0)
Requirement already satisfied: bleach in c:\users\pedro\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (3.1.5)
Requirement already satisfied: webencodings in c:\users\pedro\anaconda3\lib\site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.5.1)
Requirement already satisfied: packaging in c:\users\pedro\anaconda3\lib\site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (20.4)
Building wheels for collected packages: htmlmin
  Building wheel for htmlmin (setup.py): started
  Building wheel for htmlmin (setup.py): finished with status 'done'
  Created wheel for htmlmin: filename=htmlmin-0.1.12-py3-none-any.whl size=27090 sha256=590c99874d455a84ea31183637968719be16ae560abdfb805079df39690157d6
  Stored in directory: c:\users\pedro\appdata\local\pip\cache\wheels\23\14\6e\4be5bfeeb027f4939a01764b48edd5996acf574b0913fe5243
Successfully built htmlmin
Installing collected packages: tangled-up-in-unicode, imagehash, visions, htmlmin, phik, confuse, missingno, pandas-profiling
Successfully installed confuse-1.4.0 htmlmin-0.1.12 imagehash-4.2.0 missingno-0.4.2 pandas-profiling-2.9.0 phik-0.10.0 tangled-up-in-unicode-0.0.6 visions-0.5.0
In [127]:
from pandas_profiling import ProfileReport

profile = ProfileReport(interest_df_z )

profile



Out[127]:

Insight # 9

  • The main takeaway of this step is that for the next projects I will apply it at the beggning of the study. It give an idea of where to look at to see more details, like relevant correlations and more important variables
  • Most of the observations from this report about the data were mentioned already
  • ## 3.0 Get the data model ready

This step was done above when the column of Customer Key was dropped creating the data frame: interest_df

Also the data was scaleted creating this dataframe: interest_df_z and this will be used to work in the following sections

  • ## 3.1 Elbow plot
In [185]:
#Finding optimal no. of clusters using the Data frame before scalat it
from scipy.spatial.distance import cdist
clusters=range(1,10)
meanDistortions=[]

for k in clusters:
    model=KMeans(n_clusters=k)
    model.fit(interest_df)
    prediction=model.predict(interest_df)
    meanDistortions.append(sum(np.min(cdist(interest_df, model.cluster_centers_, 'euclidean'), axis=1)) / interest_df.shape[0])


plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')
Out[185]:
Text(0.5, 1.0, 'Selecting k with the Elbow Method')
In [251]:
#Finding optimal no. of clusters using the scalated dataframe
## This was done to see if there was a big difference in the graphics, in both cases 3 is n_cluster obtained.
from scipy.spatial.distance import cdist
clusters=range(1,10)
meanDistortions=[]

for k in clusters:
    model2=KMeans(n_clusters=k)
    model2.fit(interest_df_z)
    prediction=model2.predict(interest_df_z)
    meanDistortions.append(sum(np.min(cdist(interest_df_z, model2.cluster_centers_, 'euclidean'), axis=1)) / interest_df_z.shape[0])


plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method using scaled dataframe')
Out[251]:
Text(0.5, 1.0, 'Selecting k with the Elbow Method using scaled dataframe')
In [192]:
#Set the value of k=3
model3 = KMeans(n_clusters=3,n_init = 15, random_state=2345)
model3.fit(interest_df_z)
preds = model3.predict(interest_df_z)
from sklearn.metrics import silhouette_score
labels = model3.labels_
silhouette_score(interest_df_z, labels, metric='euclidean')
Out[192]:
0.5157182558881063
In [205]:
#Set the value of k=4 
## This step is just to see the result of the Score using 4 clusters and we confirm that 3 is better.
model4 = KMeans(n_clusters=4, n_init = 15, random_state=2345)
model4.fit(interest_df_z)
preds4= model4.predict(interest_df_z)

labels4 = model4.labels_
silhouette_score(interest_df_z, labels4, metric='euclidean')
Out[205]:
0.48867390173664815
In [193]:
centroids = model3.cluster_centers_
centroids
Out[193]:
array([[-0.59579625, -1.05962278, -0.9015185 ,  0.32299678,  1.14810882],
       [-0.02106178,  0.37368962,  0.6663945 , -0.55367163, -0.55300488],
       [ 2.83176409,  1.86222621, -1.10576269,  2.82731942, -0.87432983]])
In [194]:
#Clculate the centroids for the columns to profile
centroid_df = pd.DataFrame(centroids, columns = list(interest_df_z) )
print(centroid_df)
   Avg_Credit_Limit  Total_Credit_Cards  Total_visits_bank  \
0            -0.596              -1.060             -0.902   
1            -0.021               0.374              0.666   
2             2.832               1.862             -1.106   

   Total_visits_online  Total_calls_made  
0                0.323             1.148  
1               -0.554            -0.553  
2                2.827            -0.874  
In [199]:
## creating a new dataframe only for labels and converting it into categorical variable
df_labels = pd.DataFrame(model3.labels_ , columns = list(['labels']))

df_labels['labels'] = df_labels['labels'].astype('category')
In [200]:
# Joining the label dataframe with the data frame.
df_labeled = interest_df.join(df_labels)
In [201]:
df_analysis = (df_labeled.groupby(['labels'] , axis=0)).head(4177)  # the groupby creates a groupeddataframe that needs 
# to be converted back to dataframe. 
df_analysis
Out[201]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made labels
Sl_No
1 100000 2 1 1 0 0
2 50000 3 0 10 9 1
3 50000 7 1 3 4 1
4 30000 5 1 1 4 2
5 100000 6 0 12 3 0
... ... ... ... ... ... ...
656 99000 10 1 10 0 2
657 84000 10 1 13 2 2
658 145000 8 1 9 1 2
659 172000 10 1 15 0 2
660 167000 9 0 12 2 NaN

660 rows × 6 columns

In [202]:
df_labeled['labels'].value_counts()  
Out[202]:
1    385
0    224
2     50
Name: labels, dtype: int64

http://blog.mahler83.net/2019/10/rotating-3d-t-sne-animated-gif-scatterplot-with-matplotlib/

In [165]:
## This code didn't work as expeected, the 3D plot was supposed to rotate.
from mpl_toolkits.mplot3d import axes3d, Axes3D

fig = plt.figure(figsize=(10,10))

ax = Axes3D(fig)



x = interest_df.Total_visits_bank
y = interest_df.Total_visits_online
z = interest_df.Total_calls_made


g = ax.scatter(x, y, z, c=x, marker='o', depthshade=False, cmap='Paired')
ax.set_xlabel('Total Bank Visits')
ax.set_ylabel('Total Visits Online')
ax.set_zlabel('Total Calls Made')

# produce a legend with the unique colors from the scatter
legend = ax.legend(*g.legend_elements(), loc="lower center", title="Total bank visits", borderaxespad=-10, ncol=4)
ax.add_artist(legend)

# plt.show()

from matplotlib import animation

def rotate(angle):
     ax.view_init(azim=angle)

angle = 1
ani = animation.FuncAnimation(fig, rotate, frames=np.arange(0, 360, angle), interval=1)
ani.save('Cluster_plot.gif', writer=animation.PillowWriter(fps=25));
  • ## 3.2 Analyse clusters using boxplot
In [227]:
# K = 3 
final_model=KMeans(n_clusters=3)
final_model.fit(interest_df_z)
prediction=final_model.predict(interest_df_z)

#Append the prediction 
interest_df["GROUP"] = prediction #adding the predictions to the unscaled data
interest_df_z["GROUP"] = prediction #adding the predictions to the scaled data
print("Groups Assigned : \n")
interest_df_z
Groups Assigned : 

Out[227]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made GROUP
Sl_No
1 1.740 -1.249 -0.860 -0.547 -1.252 0
2 0.410 -0.788 -1.474 2.521 1.892 1
3 0.410 1.059 -0.860 0.134 0.146 0
4 -0.122 0.136 -0.860 -0.547 0.146 0
5 1.740 0.597 -1.474 3.202 -0.204 2
... ... ... ... ... ... ...
656 1.714 2.444 -0.860 2.521 -1.252 2
657 1.315 2.444 -0.860 3.543 -0.553 2
658 2.937 1.521 -0.860 2.180 -0.902 2
659 3.655 2.444 -0.860 4.225 -1.252 2
660 3.522 1.982 -1.474 3.202 -0.553 2

660 rows × 6 columns

In [208]:
interest_df.groupby("GROUP").count()
Out[208]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
GROUP
0 225 225 225 225 225
1 385 385 385 385 385
2 50 50 50 50 50

Insight # 11

  • From the table above we can see that the group #1 has higher number of clients and represent more than 50% of the all the clients in the study
  • The group # 2 is the one with less clients, but we acn guess that it represent a select group of persons. We will see it in the next cells.
In [209]:
interest_df_z.boxplot(by = 'GROUP',  layout=(2,3), figsize=(20, 15))
Out[209]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000024A265EB4C0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000024A23EB5160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000024A2A8FEAF0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000024A259292B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000024A256D6DF0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000024A26181190>]],
      dtype=object)

Insight # 12

From the blox plot above we can explain each group as follow:

  • group 0 = Less credit limit, less credit cards, more calls , visit bank similar or more t/group 2, 2nd in visit online
  • group 1 = 2nd in cred limit, 2nd in cred cards, 2nd in calls, Higher number of visits to bank , less visit online
  • group 2 = higher cred limit, more credit cards, less calls , visit bank similar or less t/group 0, higher visits online

There is a positivi correlation between avg credit limit and number of credit card. And a negative correlation of those with the total calls made, it is to say, the more credit limit and credit card number, the less call made to the bank.

Visit to the banks is inverse to the visits online, it is to say, the group 1 that more visit the bank is the one who use less the visits online. This groups seems to prefer more personal contact with the the Bank.

  • ## 4. Hierarchical Clustering
In [228]:
interest_df_z.drop ("GROUP", inplace=True, axis=1) # Deleting the colum added before
print (interest_df_z.shape)
interest_df_z.head()
(660, 5)
Out[228]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
Sl_No
1 1.740 -1.249 -0.860 -0.547 -1.252
2 0.410 -0.788 -1.474 2.521 1.892
3 0.410 1.059 -0.860 0.134 0.146
4 -0.122 0.136 -0.860 -0.547 0.146
5 1.740 0.597 -1.474 3.202 -0.204
In [213]:
#Use ward as linkage metric and distance as Eucledian
In [229]:
#### generate the linkage matrix
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(interest_df_z, 'ward', metric='euclidean')
Z.shape
Out[229]:
(659, 4)
In [216]:
Z[:]
Out[216]:
array([[ 464.        ,  497.        ,    0.        ,    2.        ],
       [ 425.        ,  455.        ,    0.        ,    2.        ],
       [ 250.        ,  361.        ,    0.        ,    2.        ],
       ...,
       [1313.        , 1314.        ,   16.84480374,  385.        ],
       [1311.        , 1316.        ,   47.06715339,  435.        ],
       [1315.        , 1317.        ,   50.16298666,  660.        ]])

Plot the dendrogram for the consolidated dataframe

In [217]:
plt.figure(figsize=(25, 10))
dendrogram(Z)
plt.show()

From the truncated dendrogram, find out the optimal distance between clusters which u want to use an input for clustering data

In [218]:
# Hint: Use truncate_mode='lastp' attribute in dendrogram function to arrive at dendrogram
dendrogram(
    Z,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=3,  # show only the last p merged clusters
)
plt.show()
In [233]:
max_d = 20

Use this distance measure(max_d) and fcluster function to cluster the data into 3 different groups

In [234]:
from scipy.cluster.hierarchy import fcluster
clusters = fcluster(Z, max_d, criterion='distance')
clusters
Out[234]:
array([3, 1, 3, 3, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
      dtype=int32)
In [235]:
# Calculate Avg Silhoutte Score
#from sklearn.metrics import silhouette_score
silhouette_score(interest_df_z,clusters)
Out[235]:
0.5147639589977819

Silhouette Score is better when closer 1 and worse when closer to -1

Here, it is good but need more study to see if it can be better. The value is similar to K-mean for k=3

Final dendogram with 'ward linkage'

In [240]:
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist
import matplotlib.pyplot as plt
plt.figure(figsize=(18, 16))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
Z = linkage(interest_df_z, 'ward')
dendrogram(Z,leaf_rotation=90.0,p=5,color_threshold=30, leaf_font_size=10,truncate_mode='level')
plt.tight_layout()
  • ## 4.1 Analyse clusters using boxplot

Since we didn't cover the boxplot for Hierarchical cluster in the mentored session, I searched on internet to find a create the boxplot and compared it with the step 3.2 and got to what is shown below

In [247]:
from sklearn.cluster import AgglomerativeClustering
model5=AgglomerativeClustering(n_clusters=3, affinity='euclidean',  linkage='average')
model5.fit(interest_df_z)
prediction2=model5.labels_
interest_df['Group2'] = prediction2 #to differenciate from the Group created in the k-means part
interest_df_z['Group2'] = prediction2

interest_df.groupby('Group2').count()
Out[247]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made GROUP
Group2
0 387 387 387 387 387 387
1 50 50 50 50 50 50
2 223 223 223 223 223 223

Insight # 13

  • The table above has similar numbers to the observed in Insight 11 for which the observation is similar.
  • We can see that the group #0 has higher number of clients and represent more than 50% of the all the clients in the study
  • The group # 1 is the one with less clients, but represents a select group of persons with more economical power.
In [248]:
interest_df_z.boxplot(by = 'Group2',  layout=(2,3), figsize=(20, 15))
Out[248]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000024A36C7E9D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000024A36CCFA30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000024A36CECA00>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000024A39E288B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000024A39E54760>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000024A39E7FAF0>]],
      dtype=object)

Insight # 14

From the blox plot above we can explain each group as follow:

  • group 0 = 2nd in cred limit, 2nd in credit cards, 2nd in calls, Higher visit bank, less visit online
  • group 1 = higher cred limit, more credit cards , less calls , less visit bank , higher visits online
  • group 2 = Less credit limit, less credit cards , more calls , visit bank similar or more t/group 1, 2nd in visit online

This boxplot is vey similar to the observed in Insight 12:

There is a positivi correlation between avg credit limit and number of credit card. And a negative correlation of those with the total calls made, it is to say.

Visit to the banks is inverse to the visits online.

  • It was 0.5157182558881063
  • It was 0.5147639589977819

For this excercise with the used parameters the result was similar in both techniques. The K-mean gave a slight better Score but overal the relation between the groups were the same.

At the time of this report, I adon't know why the order of the groups changed from 1 technique to another. It is to say, the group#2 in K-means is equivalent to the group#1 in Hierarchical.

What i can see until this point is that the Hierarchical Cluster techniques allows to study easier the number of groups and similarly it is better for me from the visualization point of view using the dendogram.

  • In both techniques we got 3 groups and from my point of view thy can represent the importance for the bank economically speaking. This is because Credit card limit and number of credit car are key players in the results and positively correlated. It means that those person move a interesting amount of expenses that are attractive for the bank.

2- How are these segments different from each other?

  • there are 3 clear difference between the groups.
    • 1- The amount of expenses reflected in the credit limit and number of credit card.
    • 2- The way those groups communicate with the bank: by phone or visiting the bank. They have a negative correlation as seen in the histograms and Insights 11 and 13
    • 3- The ammount of visit to the webpage of the bank. This is interesting because the group with more visits online is the group with less clients representing the less than 10%. However it would be interesting to compare the money that each group is working with to know how important it is to invest in the webpage, in the branck facilies or in phone services.

3- What are your recommendations to the bank on how to better market to and service these customers?

Given the 3 groups and the differences between them mentioned above, next step is to impact of each group in the economy of the company, in this case a Bank. Beacuase the bigger group with more than 50% of clients is the most active visting the bank and 2nd with more credit cards, but maybe the ammount in expenses is not as atractive to the bank as the quatities managed by the smaller group. The marketing team have the option to improve the service in the phone, in the branch with person to person service or in the webpage. The priority will be given to the most attractive clients for the bank.

Without having all the economic profile of the client, i would suggest to start with the promotion online since the smallwer group, which manage more credit limit is the most active on the net service.

In [ ]: